SQL Query Composition

This notebook builds on previous notebooks that illustrate relatively simple SQL queries (queries with a single SELECT statement to retrieve and analyze BigQuery data). This notebooks illustrates how Google BigQuery allows you to use nested queries to create complex queries that successively filter or slice/dice results or perform joins in a single query issued to the service. Constructing these complex queries can often be made easier when done step-by-step.

This notebook introduces a pattern for developing queries independently, and then stitching them together to construct a complete solution.

Scenario

As in other sample notebooks, this notebook will build on the HTTP logs dataset. The goal is to build a latency distribution curve of requests. This helps visualize the latency pattern of a web server.

Preface on Notebook Content

Although query composition through step-by-step building is intrinsically iterative, this notebook solution represents a snapshot of the final state. For illustrative purposes, each step of the iteration is broken out into separate sections, but it's unlikely you'll do the same in a real-task scenario.


In [1]:
import google.datalab.bigquery as bq

Step 1 - Inspecting Latencies

The first step is to get an initial slice of the data we're working with with a very basic query.


In [4]:
%%bq query
SELECT latency FROM `cloud-datalab-samples.httplogs.logs_20140615` WHERE endpoint = 'Recent' LIMIT 10


Out[4]:
latency
317
197
427
187
75
226
293
331
329
231

(rows: 10, time: 0.2s, cached, job: job_bjmgvGTsQm3-dSOzFOaGgnfEu8g)

In [5]:
%%bq query
SELECT MIN(latency) AS min, MAX(latency) AS max, COUNT(latency) AS count
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent'


Out[5]:
minmaxcount
218715734

(rows: 1, time: 1.7s, 10MB processed, job: job_JGVGCfUp8cObkqHXqHH4L4STETs)

Step 2 - Bucketing Latencies

Clearly, latencies are discrete numbers (milliseconds) over a wide range. One way to deal with this is to bucket them into groups.


In [6]:
%%bq query -n latencyGroups
SELECT SAFE_CAST(latency/25 AS INT64) * 25 AS latency, COUNT(latency) AS requests
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent'
GROUP BY latency
ORDER BY latency

In [7]:
%bq execute --query latencyGroups


Out[7]:
latencyrequests
028
2541
506
7535
10032
12517
1505
17528
200100
225109
25095
27552
30042
32548
35038
37526
40017
4254
4504
4752
5001
5251
5501
6251
187251

(rows: 25, time: 1.7s, 10MB processed, job: job_4DB-TdKjJA0Tq3ZZo3y2NpwwnPo)

In [8]:
results = latencyGroups.execute(output_options=bq.QueryOutput.dataframe()).result()
results.set_index('latency').plot(kind='bar')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fed4167c510>

Step 3 - Latency Distribution

In this step, we want to determine the percentage of requests that fall within each bucket. In order to compute percentages, we need to issue a count query to determine counts:


In [9]:
%%bq query -n buckets
SELECT SAFE_CAST(latency / 25 AS INT64) * 25 AS latency, COUNT(latency) AS requests
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent'
GROUP BY latency
ORDER BY latency

In [10]:
%%bq query -n stats
SELECT COUNT(latency) AS total
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent'

We can then use the buckets along with counts to produce percentage values. To do this, we can reference the queries defined above in a new queries, using the --subqueries argument:


In [11]:
%%bq query -n latencyDistribution --subqueries buckets stats
SELECT buckets.latency AS latency,
       buckets.requests AS requests,
       buckets.requests * 100 / stats.total AS percent
FROM buckets CROSS JOIN stats
ORDER BY latency

We can inspect the fully composed SQL represented by the query by printing its sql property.


In [12]:
print(latencyDistribution.sql)


WITH buckets AS (SELECT SAFE_CAST(latency / 25 AS INT64) * 25 AS latency, COUNT(latency) AS requests
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent'
GROUP BY latency
ORDER BY latency),
stats AS (SELECT COUNT(latency) AS total
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent')
SELECT buckets.latency AS latency,
       buckets.requests AS requests,
       buckets.requests * 100 / stats.total AS percent
FROM buckets CROSS JOIN stats
ORDER BY latency

Or by just calling the query object itself. Datalab makes inspecting objects (such as Queries) easy by defining a friendly HTML render format for them:


In [13]:
latencyDistribution


Out[13]:
WITH buckets AS (SELECT SAFE_CAST(latency / 25 AS INT64) * 25 AS latency, COUNT(latency) AS requests
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent'
GROUP BY latency
ORDER BY latency),
stats AS (SELECT COUNT(latency) AS total
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Recent')
SELECT buckets.latency AS latency,
       buckets.requests AS requests,
       buckets.requests * 100 / stats.total AS percent
FROM buckets CROSS JOIN stats
ORDER BY latency

Notice that the buckets and stats queries have been inlined as nested queries and used in a JOIN. The resulting SQL represents a single query that can be executed in BigQuery.


In [14]:
%bq execute -q latencyDistribution


Out[14]:
latencyrequestspercent
0283.81471389646
25415.58583106267
5060.817438692098
75354.76839237057
100324.35967302452
125172.31607629428
15050.681198910082
175283.81471389646
20010013.6239782016
22510914.8501362398
2509512.9427792916
275527.08446866485
300425.72207084469
325486.53950953678
350385.17711171662
375263.54223433243
400172.31607629428
42540.544959128065
45040.544959128065
47520.272479564033
50010.136239782016
52510.136239782016
55010.136239782016
62510.136239782016
1872510.136239782016

(rows: 25, time: 2.3s, 10MB processed, job: job_D98mC9AfoDmJ_T04MyWuK3uzwLM)

We can also execute it as a dataframe to and plot its results:


In [15]:
job = latencyDistribution.execute(output_options=bq.QueryOutput.dataframe())
job.result().drop(['requests'], axis = 1).set_index('latency').plot(logx = True)


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fed101c8f10>

Looking Ahead

Composing queries one step at a time and validating the results along the way (by either executing or sampling intermediate queries) allows you to build sophisticated queries while containing the complexity.

Look at some of the BigQuery samples for more examples.